Skip to content

Fix exception ref-count leak in non-ref catch handlers.#12860

Closed
cfallin wants to merge 2 commits intobytecodealliance:mainfrom
cfallin:exception-leak
Closed

Fix exception ref-count leak in non-ref catch handlers.#12860
cfallin wants to merge 2 commits intobytecodealliance:mainfrom
cfallin:exception-leak

Conversation

@cfallin
Copy link
Copy Markdown
Member

@cfallin cfallin commented Mar 27, 2026

When a Wasm throw instruction executes, the throw_ref libcall was cloning the GC ref (incrementing the refcount), but the catch handler never decremented it. This caused every caught exception to leak, leading to unbounded GC heap growth in throw/catch loops.

Two fixes:

  1. Remove the unnecessary clone_gc_ref() in throw_ref. The throw/throw_ref instructions consume the exnref operand, so ownership transfers naturally to pending_exception without cloning.

  2. In create_catch_block, emit a drop_gc_ref call for non-ref catches (Catch::One, Catch::All) after field extraction. These catches consume the exnref without passing it to the branch target, so the refcount must be decremented.

Also adds Store::gc_heap_size() / StoreContext::gc_heap_size() accessors and a throw_catch_many_times integration test that throws and catches 100K exceptions in a loop, asserting the GC heap stays within a single 64 KiB page.

@cfallin cfallin requested review from a team as code owners March 27, 2026 23:17
@cfallin cfallin requested review from fitzgen and removed request for a team March 27, 2026 23:17
@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Mar 28, 2026
When a Wasm `throw` instruction executes, the `throw_ref` libcall was
cloning the GC ref (incrementing the refcount), but the catch handler
never decremented it. This caused every caught exception to leak,
leading to unbounded GC heap growth in throw/catch loops.

Two fixes:

1. Remove the unnecessary `clone_gc_ref()` in `throw_ref`. The
   `throw`/`throw_ref` instructions consume the exnref operand, so
   ownership transfers naturally to `pending_exception` without
   cloning.

2. In `create_catch_block`, emit a `drop_gc_ref` call for non-ref
   catches (`Catch::One`, `Catch::All`) after field extraction. These
   catches consume the exnref without passing it to the branch target,
   so the refcount must be decremented.

Also adds `Store::gc_heap_size()` / `StoreContext::gc_heap_size()`
accessors and a `throw_catch_many_times` integration test that throws
and catches 100K exceptions in a loop, asserting the GC heap stays
within a single 64 KiB page.
Comment on lines +465 to +489
/// Drop a GC reference to an exception object, decrementing its ref count.
///
/// This should be called when a `catch` (non-ref) handler has finished
/// extracting fields from the caught exnref and no longer needs it.
pub fn translate_drop_exnref(
func_env: &mut FuncEnvironment<'_>,
builder: &mut FunctionBuilder<'_>,
exnref: ir::Value,
) -> WasmResult<()> {
log::trace!("translate_drop_exnref({exnref:?})");
match func_env.tunables.collector {
#[cfg(feature = "gc-drc")]
Some(Collector::DeferredReferenceCounting) => {
let drop_gc_ref_libcall = func_env.builtin_functions.drop_gc_ref(builder.func);
let vmctx = func_env.vmctx_val(&mut builder.cursor());
builder.ins().call(drop_gc_ref_libcall, &[vmctx, exnref]);
}
// The null collector doesn't track ref counts, so nothing to do.
_ => {
// Avoid unused-parameter warning.
let _ = builder;
}
}
Ok(())
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things:

  • If we add a new collector that requires some kind of explicit drop here, we will silently repeat this bug because of the wildcard. We shouldn't have wildcards for this kind of thing.
  • Even better would be to move this into a trait method on the GC compiler, as the dual of GcCompiler::alloc_exn.

} else {
// For non-ref catches, the exnref is consumed here and not
// passed to the branch target.
environ.translate_drop_exnref(builder, exn_ref)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. The GC ref should logically be held alive by the Wasm stack (in the over-approx table) and then get dropped upon the next GC. Why wasn't that happening? Are we missing a call to expose_gc_ref_to_wasm in the throw code? (Probably)

If so, then we should do that instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do call expose_gc_ref_to_wasm here when we manually convert to a wasmtime::ValRaw but you're right, I think it's missing in the throw path. I'll dig deeper -- thanks!

@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Mar 30, 2026

So digging into this a bit more I see that we always expose_gc_ref_to_wasm in the gc_alloc_raw libcall, which is used both by the exn throw path and other allocation paths (e.g. for GC structs): here

The root issue is actually that the growth heuristic for the GC heap creates the appearance of a leak (and, arguably as we play with semantics, maybe a leak in practice?) The problem I'm trying to solve is that I want setjmp/longjmp with exceptions to be robust and lightweight; that's what the test in this PR is made to emulate. If we continually throw and catch, without holding the exnref, we have only at most one object live at a time. A theoretically ideal GC should somehow see that and keep the working-set size small.

I had short-circuited the "deferred" part of DRC here with explicit drops which I agree is the wrong approach now that I dig in more. But it should also be the case that a C program using SjLj with exceptions should not grow the GC heap to its max size (4GiB say?) before ever collecting -- that seems like an objectively bad heuristic choice.

Perhaps what we need is a collection heuristic like what we do for OwnedRooted, where we have a watermark based on the actual size after last collection, and collect again at twice that size? That allows growth up to a large-working-set regime without more than amortized-constant overhead, but should still bound the working set size pretty tightly. What do you think @fitzgen?

@fitzgen
Copy link
Copy Markdown
Member

fitzgen commented Mar 30, 2026

So digging into this a bit more I see that we always expose_gc_ref_to_wasm in the gc_alloc_raw libcall, which is used both by the exn throw path and other allocation paths (e.g. for GC structs): here

It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here:

#[cfg(feature = "gc")]
fn throw_ref(
store: &mut dyn VMStore,
_instance: InstanceId,
exnref: u32,
) -> Result<(), TrapReason> {
let exnref = VMGcRef::from_raw_u32(exnref).ok_or_else(|| Trap::NullReference)?;
let exnref = store.unwrap_gc_store_mut().clone_gc_ref(&exnref);
let exnref = exnref
.into_exnref(&*store.unwrap_gc_store().gc_heap)
.expect("gc ref should be an exception object");
store.set_pending_exception(exnref);
Err(TrapReason::Exception)
}

I'm pretty sure that adding a call to expose_gc_ref_to_wasm there will fix the leak.

The root issue is actually that the growth heuristic for the GC heap creates the appearance of a leak (and, arguably as we play with semantics, maybe a leak in practice?) The problem I'm trying to solve is that I want setjmp/longjmp with exceptions to be robust and lightweight; that's what the test in this PR is made to emulate. If we continually throw and catch, without holding the exnref, we have only at most one object live at a time. A theoretically ideal GC should somehow see that and keep the working-set size small.

Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing exnrefs.

Instead, we should fix the heuristics and make it so that exnrefs Just Work the way we want them to, and that falls out automatically from updating those heuristics. That, and then also the other stuff in #11256, in the fullness of time.

In the meantime, you can update the test to use a resource limiter, custom memory creator, or the pooling allocator to constrain the max size of memories (and therefore also the GC heap) and it should still exercise the leak-checking even without the grow-or-collect heuristics fixed. The pooling allocator is probably the easiest.

@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Mar 30, 2026

It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here:

I'm a bit confused: the code that you link is a libcall that takes a ref from Wasm and pulls it into the host (to save on the store). Did you mean to link somewhere else?

Separately: writing a variant of this test, that allocates a GC struct and immediately drops it in a loop, causes the GC heap to grow without collection up to its size limit in the same way. That seems to confirm to me that this is not exn-specific but rather a general GC growth heuristic problem?

Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing exnrefs.

Yes, agreed; the next paragraph of my earlier comment suggests a growth heuristic (for the whole GC heap, not just for exnref-specific cases), exactly as you're saying. (In other words: I am definitely not proposing an exceptions-specific change.) I would imagine it could/should become the default growth heuristic. What do you think of the proposed heuristic?

@fitzgen
Copy link
Copy Markdown
Member

fitzgen commented Mar 30, 2026

Ah sorry I got confused because I had assumed that this was not about the heuristics and was something specific to the interaction between exnref and libcalls, and I was trying not to complicate the discussion by getting into multiple things at once. That backfired :-p

Perhaps what we need is a collection heuristic like what we do for OwnedRooted, where we have a watermark based on the actual size after last collection, and collect again at twice that size? That allows growth up to a large-working-set regime without more than amortized-constant overhead, but should still bound the working set size pretty tightly. What do you think @fitzgen?

Yes, this is roughly what I've been imagining, although there is the minor wrinkle of growing in units of pages, but wanting to do the accounting at a finer-grained level so that the single-page GC heap use case works correctly.

To be precise, what I've had in my head is this:

  • After each collection we record the live set size somewhere -- maybe GcStore?
  • When deciding whether to GC or grow the heap, we look at the last live set size, and if it is less than or equal to half our GC heap size, then we do a collection. Otherwise, we grow.
  • The initial live set size is zero, which causes us to always GC before growing, which should preserve the single-page GC heap use case, because if there is only ever one exnref live then our live set should always end up being below half the GC heap size.

This should also avoid doing a bunch of GCs for each ~power of two on the way up to 64KiB, which is just wasting cycles.

How does that sound?

@fitzgen
Copy link
Copy Markdown
Member

fitzgen commented Mar 30, 2026

After each collection we record the live set size somewhere -- maybe GcStore?

And because refcounting operates on anti-matter rather than matter, this does unfortunately mean that the DRC collector will need to always have a running count of allocated bytes, rather than being able to compute it just at GC time like the copying collector will be able to. Ah well, not a big deal.

@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Mar 30, 2026

That does sound reasonable, and I'm happy to work on that -- thanks! Closing this in the meantime (I'll bring back both the sjlj exception test and struct-alloc-then-drop test once we have said heuristic and assert they both stay within one page).

@cfallin cfallin closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wasmtime:api Related to the API of the `wasmtime` crate itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants